A Text Mining Approach to the Prediction of Disease Status from Clinical Discharge Summaries
نویسنده
چکیده
Design: The authors assembled a set of resources to lexically and semantically profile the diseases and their associated symptoms, treatments, etc. These features were explored in a hybrid text mining approach, which combined dictionary look-up, rule-based, and machine-learning methods. Measurements: The methods were applied on a set of 507 previously unseen discharge summaries, and the predictions were evaluated against a manually prepared gold standard. The overall ranking of the participating teams was primarily based on the macro-averaged F-measure. Results: The implemented method achieved the macro-averaged F-measure of 81% for the textual task (which was the highest achieved in the challenge) and 63% for the intuitive task (ranked 7 out of 28 teams—the highest was 66%). The micro-averaged F-measure showed an average accuracy of 97% for textual and 96% for intuitive annotations. Conclusions: The performance achieved was in line with the agreement between human annotators, indicating the potential of text mining for accurate and efficient prediction of disease statuses from clinical discharge summaries. J Am Med Inform Assoc. 2009;16:596–600. DOI 10.1197/jamia.M3096. by gest on Jauary 5, 2016 urnls.org/ Introduction The objective of the 2008 i2b2 obesity challenge in Natural language processing (NLP) for clinical data was to evaluate NLP systems on their performance in identifying patient obesity and associated co-morbidities based on hospital discharge summaries. Fifteen related diseases were considered: Diabetes mellitus (DM), Hypercholesterolemia, Hypertriglyceridemia, Hypertension (HTN), Atherosclerotic CV disease (CAD), Heart failure (CHF), Peripheral vascular disease (PVD), Venous insufficiency, Osteoarthritis (OA), Obstructive sleep apnea (OSA), Asthma, GERD, Gallstones/Cholecystectomy, Depression, and Gout. The aim was to label each document with disease/co-morbidity status, indicating whether: • a patient was diagnosed with a disease/co-morbidity (Y—yes, disease present), Affiliation of the authors: School of Computer Science, University of Manchester, Manchester, UK; Dr. Yang is currently with the Department of Computing, Open University, UK. This work was partially supported by the UK BBSRC project “Mining Term Associations from Literature to Support Knowledge Discovery in Biology”. Irena Spasic gratefully acknowledges the support of the BBSRC and EPSRC via “The Manchester Centre for Integrative Systems Biology” grant. Correspondence: Goran Nenadic, Manchester Interdisciplinary Biocentre, University of Manchester, 131 Princess Street, Manchester M1 7DN, UK; e-mail: [email protected] . Received for review: 12/07/08; accepted for publication: 04/07/09. • a patient was diagnosed with not having a disease/comorbidity (N—no, disease absent), • it was uncertain whether a patient had a disease/comorbidity or not (Q—questionable), or • a disease/co-morbidity status was not mentioned in the discharge summary (U—unmentioned). The challenge consisted of two tasks, textual and intuitive. The textual task was to identify explicit references to the diseases in the narrative text. Each hospital report was to be labeled using one of four possible disease status labels (Y, N, Q, or U). The intuitive task focused on inferring the disease status even when the evidence was not explicitly asserted. Possible intuitive labels were Y, N, and Q for each disease. The organizers provided a training set with 730 hospital discharge summaries manually annotated with more than 22,000 labels. We implemented a hybrid approach that combined three types of features: lexical, terminological and semantic, exploited by dictionary look-up, rule-based and machinelearning methods. We assembled a set of resources to lexically and semantically profile the diseases and their associated symptoms, treatments, etc. The methods were applied on a set of 507 previously unseen discharge summaries, and the predictions were evaluated against the manually prepared gold standard. In the textual task, a macroaveraged F-measure (81%) for our approach was the highest achieved in the challenge. In the intuitive task, we achieved the macro-averaged F-measure of 63%. The micro-averaged FJournal of the American Medical Informatics Association Volume 16 Number 4 July / August 2009 597 by gest on Jauary 5, 2016 ht://jam ia.oxfournals.org/ D ow nladed from measure showed an average accuracy of 97% for textual annotation and 96% for intuitive annotation, indicating the potential of text mining techniques to accurately extract the disease status from hospital discharge summaries. F i g u r e 1. The general design of the system. F i g u r e 2. The system architecture diagram. Methods The general idea underlying our approach was to identify sentences that contained evidence to support a judgment for a given disease, and then to integrate evidence gathered at the sentence level to make a prediction at the document level. The system workflow consisted of three major steps: report pre-processing, textual prediction and intuitive prediction, with the final integration of the textual and intuitive results (see Fig 1). The prediction steps were applied for each of the 16 diseases/co-morbidities separately. The report pre-processing involved basic textual processing of input discharge narratives. In the textual prediction step, explicit evidence was identified and combined to derive textual predictions. The intuitive prediction module focused on capturing intuitive clues that could associate the report with the disease. The finial intuitive judgments were combined with the textual ones. Figure 2 depicts a detailed architecture of the system. In the following sections we describe each module and the basic steps performed (for further details see a JAMIA online data supplement at http://www.jamia.org). Report Pre-processing Module Input discharge summaries were first split into sections using a set of flexible lexical matching rules that identified section titles and classified them into six predefined catesease; C 598 Yang et al., Text Mining of Disease Status Predictions by gest on Jauary 5, 2016 ht://jam ia.oxfournals.org/ D ow nladed from gories: “Diagnosis”, “Past or Present History of Illness”, “Social/Family History”, “Physical or Laboratory Examination”, “Medication/Disposition”, and “Other”. Section titles were recognized by matching the most frequent title keywords collected semi-automatically from the training dataset. In addition, each section type was assigned a weight reflecting its predictive capacity for a given disease (see the Training Data Analyses section). The sections were decomposed into sentences using LingPipe. Part of Speech (POS) tagging and shallow parsing were performed using the GeniaTagger, which is specifically tuned for biomedical text. Textual Prediction Module The main objective of this module was to identify sentences that, given a disease, explicitly mentioned the disease itself and/or associated clinical terms. We lexically profiled each disease by collecting (1) its name and synonyms from public resources including the UMLS, (2) disease sub-classes (e.g., diabetes type II) and their synonyms, (3) disease superclasses (e.g., reflux for GERD and arthritis for OA) and their synonyms, and (4) clinical terms closely related to the disease (e.g., associated symptoms and treatments), imported from public medical resources or selected from the training dataset based on their occurrence statistics. All clinical terms collected were assigned confidence levels taking into account the quality of the prediction results obtained from the training dataset (available as an online data supplement at www.jamia.org). Initially, the sentences that contained any term from the lexical profile were labeled with Y, and, in the subsequent steps, the evidence was challenged and potentially reversed to N, Q, or U based on the context in which they were used. The sentence-based predictions were then combined at the document level. The four processing steps in this module are described briefly below (further details are given in the online supplement). Step T1: Term matching. To cater for terminological variation, terms that characterize a disease were matched against the text approximately, taking into account morphological variants, and if necessary ignoring word order and tolerating the distance between the words within a term (e.g., both “stent placement” and “placement of coronary stent” referred to the same treatment for CAD). Step T2: Sentence filtering. Sentences that did not mention a disease-related term were filtered out. We also discarded sentences from the sections deemed less important for the textual task (namely “Social/Family History” and “Other”), sentences that potentially referred to family members, and Table 1 y Examples of Disease-Status Lexico-Semantic
منابع مشابه
Research Paper: A Text Mining Approach to the Prediction of Disease Status from Clinical Discharge Summaries
OBJECTIVE The authors present a system developed for the Challenge in Natural Language Processing for Clinical Data-the i2b2 obesity challenge, whose aim was to automatically identify the status of obesity and 15 related co-morbidities in patients using their clinical discharge summaries. The challenge consisted of two tasks, textual and intuitive. The textual task was to identify explicit refe...
متن کاملPrediction of user's trustworthiness in web-based social networks via text mining
In Social networks, users need a proper estimation of trust in others to be able to initialize reliable relationships. Some trust evaluation mechanisms have been offered, which use direct ratings to calculate or propagate trust values. However, in some web-based social networks where users only have binary relationships, there is no direct rating available. Therefore, a new method is required t...
متن کاملUsing Combined Descriptive and Predictive Methods of Data Mining for Coronary Artery Disease Prediction: a Case Study Approach
Heart disease is one of the major causes of morbidity in the world. Currently, large proportions of healthcare data are not processed properly, thus, failing to be effectively used for decision making purposes. The risk of heart disease may be predicted via investigation of heart disease risk factors coupled with data mining knowledge. This paper presents a model developed using combined descri...
متن کاملStatistical Section Segmentation in Free-Text Clinical Records
Automatically segmenting and classifying clinical free text into sections is an important first step to automatic information retrieval, information extraction and data mining tasks, as it helps to ground the significance of the text within. In this work we describe our approach to automatic section segmentation of clinical records such as hospital discharge summaries and radiology reports, alo...
متن کاملResearch Paper: A System for Classifying Disease Comorbidity Status from Medical Discharge Summaries Using Automated Hotspot and Negated Concept Detection
OBJECTIVE Free-text clinical reports serve as an important part of patient care management and clinical documentation of patient disease and treatment status. Free-text notes are commonplace in medical practice, but remain an under-used source of information for clinical and epidemiological research, as well as personalized medicine. The authors explore the challenges associated with automatica...
متن کامل